Tables to LaTeX: structure and content extraction from scientific tables
نویسندگان
چکیده
Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from embedded within PDF research is very challenging task due to the existence of visual features like spanning cells mathematical symbols equations. Most existing table structure identification methods tend ignore these academic writing features. In this paper, we adapt transformer-based language modeling paradigm for scientific extraction. Specifically, proposed model converts tabular image its corresponding LaTeX source code. Overall, outperform current state-of-the-art baselines achieve an exact match accuracy 70.35 49.69% on extraction, respectively. Further analysis demonstrates models efficiently identify number rows columns, alphanumeric characters, tokens, symbols.
منابع مشابه
Semi-automatic Data Extraction from Tables
This paper describes a novel approach to automate extraction of useful information from tables and to record the knowledge procured in a structured data repository. The approach is based on modeling a behavior of an expert, who collects tabular data and maps them to a predefined relational schema. Experimental results demonstrate that the proposed approach predicts expert decisions with high ac...
متن کاملFrom Tables to Frames
Turning the current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of frames out of tables which subsequently supports the automatic population of ontologies from table-like structures. The approach consists of a methodology, an accompanying implementati...
متن کاملDisentangling the Structure of Tables in Scientific Literature
Within the scientific literature, tables are commonly used to present factual and statistical information in a compact way, which is easy to digest by readers. The ability to "understand" the structure of tables is key for information extraction in many domains. However, the complexity and variety of presentation layouts and value formats makes it difficult to automatically extract roles and re...
متن کاملPlain Answers to Several Questions about Association/Independence Structure in Complete/Incomplete Contingency Tables
In this paper, we develop some results based on Relational model (Klimova, et al. 2012) which permits a decomposition of logarithm of expected cell frequencies under a log-linear type model. These results imply plain answers to several questions in the context of analyzing of contingency tables. Moreover, determination of design matrix and hypothesis-induced matrix of the model will be discusse...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal on Document Analysis and Recognition
سال: 2022
ISSN: ['1433-2833', '1433-2825']
DOI: https://doi.org/10.1007/s10032-022-00420-9